{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 文生图数据集\n",
    "\n",
    "千帆平台支持用户上传文生图数据集，并使用文生图数据集对模型进行训练。\n",
    "\n",
    "本篇教程将会介绍，如何在本地创建一个数据集，并且将该数据集上传至千帆平台，以供后续操作\n",
    "\n",
    "# 前置准备\n",
    "\n",
    "在开始之前，首先请将千帆 Python SDK 更新至最新版本"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "vscode": {
     "languageId": "shellscript"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      
     ]
    }
   ],
   "source": [
    "pip install -U qianfan"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "并且在环境变量中设置好 Access Key 与 Secret Key"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import logging\n",
    "import os\n",
    "\n",
    "from qianfan.utils import enable_log\n",
    "\n",
    "os.environ['QIANFAN_ACCESS_KEY'] = 'your_access_key'\n",
    "os.environ['QIANFAN_SECRET_KEY'] = 'your_secret_key'\n",
    "your_bos_storage_id = \"your_bos_storage_id\"\n",
    "your_bos_storage_path = \"your_bos_storage_path\"\n",
    "\n",
    "# 选择打印出来的日志等级，目前打印出 info 级别\n",
    "enable_log(logging.INFO)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 正文\n",
    "\n",
    "千帆平台使用的文生图数据集采用文件夹的形式组织。一个数据集中，包含若干个后缀名为 (jpg/jpeg/bmp/png) 的图片文件，以及若干个后缀名为 json 的 json 文件\n",
    "\n",
    "其中，需要关注的是 json 文件。每个 json 文件都是一个 json 字典对象，其中仅包含 `prompt` 字段，用于标注与 json 文件同名的图片文件的内容。\n",
    "\n",
    "一个用于描述 \"狗在草坪上打滚\"，名为 `example.json` 的 json 文件，其内容可以是:\n",
    "```json\n",
    "{\n",
    "    \"prompt\": \"一只狗在草坪上打滚\"\n",
    "}\n",
    "```\n",
    "\n",
    "一个图片可以不包含对应的 json 文件，但若包含，则必须保证其文件名与图片文件名相同。\n",
    "\n",
    "## 使用千帆 Python SDK 读取\n",
    "\n",
    "千帆 Python SDK 中提供的 `Dataset` 类支持用户读取一个文生图数据集，并返回一个包含了相关信息的 `Dataset`  对象"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[INFO] [03-19 16:30:44] dataset.py:880 [t:8094817088]: list local dataset data by None\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[{'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/8.jpg', 'annotation': None}, {'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/9.jpg', 'annotation': None}, {'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/4.jpg', 'annotation': {'prompt': '桌子上有一些白色小棒、两块布、一把剪刀。'}}, {'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/5.jpg', 'annotation': {'prompt': '一个男人和一个女人在一张桌子上用电脑。'}}, {'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/7.jpg', 'annotation': {'prompt': '一个戴眼镜的男人旁边有一只黑猫。'}}, {'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/6.jpg', 'annotation': {'prompt': '一群长颈鹿站在一棵大树边。'}}, {'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/2.jpg', 'annotation': {'prompt': '一个装满早餐的盘子两侧放着刀叉。'}}, {'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/3.jpg', 'annotation': {'prompt': '桌子上面有一份咖啡和一个甜甜圈。'}}, {'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/1.jpg', 'annotation': {'prompt': '黑白照片上一个人站在放满东西的长椅旁边。'}}, {'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/0.jpg', 'annotation': {'prompt': '飞机上的一排座位空空的。'}}]\n"
     ]
    }
   ],
   "source": [
    "from qianfan.dataset import Dataset\n",
    "from qianfan.dataset.data_source import FileDataSource\n",
    "from qianfan.dataset.data_source.base import FormatType\n",
    "\n",
    "file_data_source = FileDataSource(path=\"data_file/text2img_example_data\", file_format=FormatType.Text2Image)\n",
    "\n",
    "ds = Dataset.load(file_data_source)\n",
    "\n",
    "print(ds.list())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "用户在读取进来之后，可以手动对数据集中的内容进行修改，例如我们需要过滤掉所有不存在标注的图片："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[INFO] [03-19 16:30:44] dataset.py:880 [t:8094817088]: list local dataset data by None\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[{'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/4.jpg', 'annotation': {'prompt': '桌子上有一些白色小棒、两块布、一把剪刀。'}}, {'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/5.jpg', 'annotation': {'prompt': '一个男人和一个女人在一张桌子上用电脑。'}}, {'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/7.jpg', 'annotation': {'prompt': '一个戴眼镜的男人旁边有一只黑猫。'}}, {'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/6.jpg', 'annotation': {'prompt': '一群长颈鹿站在一棵大树边。'}}, {'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/2.jpg', 'annotation': {'prompt': '一个装满早餐的盘子两侧放着刀叉。'}}, {'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/3.jpg', 'annotation': {'prompt': '桌子上面有一份咖啡和一个甜甜圈。'}}, {'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/1.jpg', 'annotation': {'prompt': '黑白照片上一个人站在放满东西的长椅旁边。'}}, {'image_path': '/Users/pengyiyang/Desktop/github/bce-qianfan-sdk/cookbook/dataset/data_file/text2img_example_data/0.jpg', 'annotation': {'prompt': '飞机上的一排座位空空的。'}}]\n"
     ]
    }
   ],
   "source": [
    "ds = ds.filter(lambda x: x[\"annotation\"] is not None)\n",
    "\n",
    "print(ds.list())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 上传到千帆\n",
    "\n",
    "这些数据集最终还是需要上传到千帆，以供后续在平台上进行的训练。千帆 SDK 也为用户提供了相应的接口，来帮助用户上传"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[INFO] [03-19 16:30:44] baidu_qianfan.py:451 [t:8094817088]: start to create dataset on qianfan\n",
      "[INFO] [03-19 16:30:44] baidu_qianfan.py:469 [t:8094817088]: create dataset on qianfan successfully\n",
      "[INFO] [03-19 16:30:44] baidu_qianfan.py:237 [t:8094817088]: start to upload data to user BOS\n",
      "[INFO] [03-19 16:30:45] baidu_qianfan.py:249 [t:8094817088]: uploading data to user BOS finished\n",
      "[INFO] [03-19 16:30:46] utils.py:476 [t:8094817088]: successfully create importing task\n",
      "[INFO] [03-19 16:30:48] utils.py:479 [t:8094817088]: polling import task status\n",
      "[INFO] [03-19 16:30:48] utils.py:486 [t:8094817088]: import status: 1, keep polling\n",
      "[INFO] [03-19 16:30:50] utils.py:479 [t:8094817088]: polling import task status\n",
      "[INFO] [03-19 16:30:51] utils.py:486 [t:8094817088]: import status: 1, keep polling\n",
      "[INFO] [03-19 16:30:53] utils.py:479 [t:8094817088]: polling import task status\n",
      "[INFO] [03-19 16:30:53] utils.py:486 [t:8094817088]: import status: 1, keep polling\n",
      "[INFO] [03-19 16:30:55] utils.py:479 [t:8094817088]: polling import task status\n",
      "[INFO] [03-19 16:30:56] utils.py:489 [t:8094817088]: import succeed\n"
     ]
    }
   ],
   "source": [
    "from qianfan.dataset.data_source import QianfanDataSource\n",
    "from qianfan.resources.console.consts import DataStorageType, DataTemplateType\n",
    "\n",
    "from qianfan.utils.utils import generate_letter_num_random_id\n",
    "\n",
    "qianfan_data_source = QianfanDataSource.create_bare_dataset(\n",
    "    f\"text_to_image{generate_letter_num_random_id(6)}\",\n",
    "    DataTemplateType.Text2Image,\n",
    "    DataStorageType.PrivateBos,\n",
    "    your_bos_storage_id,\n",
    "    your_bos_storage_path,\n",
    ")\n",
    "\n",
    "qianfan_ds = ds.save(qianfan_data_source)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 上传到 Bos\n",
    "\n",
    "千帆 Python SDK 也提供了上传到百度智能云云对象存储（BOS）的功能。用户可根据自身需要进行选择"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[INFO] [03-19 16:30:56] bos.py:117 [t:8094817088]: start to upload file\n",
      "[INFO] [03-19 16:30:56] bos_uploader.py:91 [t:8094817088]: check if bos file existed\n"
     ]
    }
   ],
   "source": [
    "from qianfan.dataset.data_source import BosDataSource\n",
    "\n",
    "bos_data_source = BosDataSource(\n",
    "    region=\"bj\",\n",
    "    bucket=your_bos_storage_id,\n",
    "    bos_file_path=your_bos_storage_path + f\"text_to_image_dataset/ds_{generate_letter_num_random_id()}\",\n",
    "    file_format=FormatType.Text2Image,\n",
    ")\n",
    "\n",
    "bos_ds = ds.save(bos_data_source)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 从千帆和 Bos 下载数据集到本地\n",
    "\n",
    "除了将数据集上传到千帆和 Bos 之外，用户还可以将上面的数据集下载到本地，方便进行其它操作。以刚才我们上传的数据集为例子进行演示："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[INFO] [03-19 16:30:57] dataset.py:462 [t:8094817088]: no destination data source was provided, construct\n",
      "[INFO] [03-19 16:30:57] dataset.py:257 [t:8094817088]: construct a file data source from path: t2i_from_qianfan, with args: {'file_format': <FormatType.Text2Image: 'text2image'>}\n",
      "[INFO] [03-19 16:30:57] baidu_qianfan.py:345 [t:8094817088]: no cache was found, download cache\n",
      "[INFO] [03-19 16:30:57] baidu_qianfan.py:271 [t:8094817088]: get dataset info succeeded for dataset id ds-mt06fsm54cram18v\n",
      "[INFO] [03-19 16:30:57] utils.py:610 [t:8094817088]: start to export dataset\n",
      "[INFO] [03-19 16:30:58] utils.py:614 [t:8094817088]: create dataset export task successfully\n",
      "[INFO] [03-19 16:31:00] utils.py:619 [t:8094817088]: polling export task status\n",
      "[INFO] [03-19 16:31:01] utils.py:627 [t:8094817088]: export status: 1, keep polling\n",
      "[INFO] [03-19 16:31:03] utils.py:619 [t:8094817088]: polling export task status\n",
      "[INFO] [03-19 16:31:03] utils.py:627 [t:8094817088]: export status: 1, keep polling\n",
      "[INFO] [03-19 16:31:05] utils.py:619 [t:8094817088]: polling export task status\n",
      "[INFO] [03-19 16:31:05] utils.py:627 [t:8094817088]: export status: 1, keep polling\n",
      "[INFO] [03-19 16:31:07] utils.py:619 [t:8094817088]: polling export task status\n",
      "[INFO] [03-19 16:31:08] utils.py:627 [t:8094817088]: export status: 1, keep polling\n",
      "[INFO] [03-19 16:31:10] utils.py:619 [t:8094817088]: polling export task status\n",
      "[INFO] [03-19 16:31:10] utils.py:627 [t:8094817088]: export status: 1, keep polling\n",
      "[INFO] [03-19 16:31:12] utils.py:619 [t:8094817088]: polling export task status\n",
      "[INFO] [03-19 16:31:13] utils.py:624 [t:8094817088]: export succeed\n",
      "[INFO] [03-19 16:31:13] utils.py:558 [t:8094817088]: get export records succeeded for dataset id ds-mt06fsm54cram18v\n",
      "[INFO] [03-19 16:31:13] utils.py:572 [t:8094817088]: latest dataset with time2024-03-19 16:31:13 for dataset ds-mt06fsm54cram18v\n",
      "[INFO] [03-19 16:31:14] dataset.py:195 [t:8094817088]: change local file format FormatType.Text2Image to qianfan file format FormatType.Text2Image\n",
      "[INFO] [03-19 16:31:14] dataset.py:462 [t:8094817088]: no destination data source was provided, construct\n",
      "[INFO] [03-19 16:31:14] dataset.py:257 [t:8094817088]: construct a file data source from path: t2i_dataset_from_bos, with args: {'file_format': <FormatType.Text2Image: 'text2image'>}\n",
      "[INFO] [03-19 16:31:14] bos.py:292 [t:8094817088]: cache was outdated, start to update bos cache\n"
     ]
    }
   ],
   "source": [
    "t2i_dataset_from_qianfan = Dataset.load(qianfan_data_source).save(data_file=\"t2i_from_qianfan\", file_format=FormatType.Text2Image)\n",
    "\n",
    "t2i_dataset_from_bos = Dataset.load(bos_data_source).save(data_file=\"t2i_dataset_from_bos\", file_format=FormatType.Text2Image)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "bce-qianfan-sdk-new",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
